Personalized chatbots focus on endowing the chatbots with a consistent personality to behave like real users and further act as personal assistants. Previous studies have explored generating implicit user profiles from the user's dialogue history for building personalized chatbots. However, these studies only use the response generation loss to train the entire model, thus it is prone to suffer from the problem of data sparsity. Besides, they overemphasize the final generated response's quality while ignoring the correlations and fusions between the user's dialogue history, leading to rough data representations and performance degradation. To tackle these problems, we propose a self-supervised learning framework MCP for capturing better representations from users' dialogue history for personalized chatbots. Specifically, we apply contrastive sampling methods to leverage the supervised signals hidden in user dialog history, and generate the pre-training samples for enhancing the model. We design three pre-training tasks based on three types of contrastive pairs from user dialogue history, namely response pairs, sequence augmentation pairs, and user pairs. We pre-train the utterance encoder and the history encoder towards the contrastive objectives and use these pre-trained encoders for generating user profiles while personalized response generation. Experimental results on two real-world datasets show a significant improvement in our proposed model MCP compared with the existing methods.
translated by 谷歌翻译
近年来,在应用预训练的语言模型(例如Bert)上,取得了巨大进展,以获取信息检索(IR)任务。在网页中通常使用的超链接已被利用用于设计预训练目标。例如,超链接的锚文本已用于模拟查询,从而构建了巨大的查询文档对以进行预训练。但是,作为跨越两个网页的桥梁,尚未完全探索超链接的潜力。在这项工作中,我们专注于建模通过超链接连接的两个文档之间的关系,并为临时检索设计一个新的预训练目标。具体而言,我们将文档之间的关系分为四组:无链接,单向链接,对称链接和最相关的对称链接。通过比较从相邻组采样的两个文档,该模型可以逐渐提高其捕获匹配信号的能力。我们提出了一个渐进的超链接预测({php})框架,以探索预训练中超链接的利用。对两个大规模临时检索数据集和六个提问数据集的实验结果证明了其优于现有的预训练方法。
translated by 谷歌翻译
搜索会话中的上下文信息对于捕获用户的搜索意图很重要。已经提出了各种方法来对用户行为序列进行建模,以改善会话中的文档排名。通常,(搜索上下文,文档)对的训练样本在每个训练时期随机采样。实际上,了解用户的搜索意图和判断文档的相关性的困难从一个搜索上下文到另一个搜索上下文有很大差异。混合不同困难的训练样本可能会使模型的优化过程感到困惑。在这项工作中,我们为上下文感知文档排名提出了一个课程学习框架,其中排名模型以易于恐惧的方式学习搜索上下文和候选文档之间的匹配信号。这样一来,我们旨在将模型逐渐指向全球最佳。为了利用正面和负面示例,设计了两个课程。两个真实查询日志数据集的实验表明,我们提出的框架可以显着提高几种现有方法的性能,从而证明课程学习对上下文感知文档排名的有效性。
translated by 谷歌翻译
广义文本表示是许多自然语言理解任务的基础。要充分利用不同的语料库,不可避免地需要了解它们之间的相关性。但是,许多方法忽略了相关性,并直接用于所有任务的单通道模型(粗糙的范式),这缺乏足够的理性和解释。此外,一些现有的作品通过针迹技能块(一个精细的范式)学习下游任务,这可能会导致其冗余和噪音,从而导致非理性。在这项工作中,我们首先通过三种不同的观点分析任务相关性,即数据属性,手动设计和基于模型的相关性,基于相似的任务被分组在一起。然后,我们提出了一个用粗到细范式的层次结构框架,其最底层共享了所有任务,中层级别分为不同的组,以及分配给每个任务的顶级级别。这使我们的模型可以从所有任务中学习基本的语言属性,提高相关任务的性能,并减少不相关任务的负面影响。我们在五个自然语言理解任务的13个基准数据集上进行的实验证明了我们方法的优势。
translated by 谷歌翻译
除局部相关性外,开放域的Factoid问题回答的段落排名还需要一个段落以包含答案(答案)。尽管最近的一些研究将一些阅读能力纳入了排名者以说明答复性,但排名仍然受到该领域通常可用的训练数据的嘈杂性质的阻碍,这将考虑任何包含答案实体作为正样本的段落。但是,段落中的答案实体不一定与给定的问题有关。为了解决该问题,我们提出了一种基于生成对抗性神经网络的通道重新管理的方法,称为\ ttt {pregan},除了局部相关性外,还结合了关于答复性的歧视者。目的是强迫发电机对局部相关的段落进行排名,并包含答案。五个公共数据集的实验表明,\ ttt {pregan}可以更好地对适当的段落进行排名,从而提高质量检查系统的有效性,并在不使用外部数据的情况下优于现有方法。
translated by 谷歌翻译
This paper presents a novel framework for planning in unknown and occluded urban spaces. We specifically focus on turns and intersections where occlusions significantly impact navigability. Our approach uses an inpainting model to fill in a sparse, occluded, semantic lidar point cloud and plans dynamically feasible paths for a vehicle to traverse through the open and inpainted spaces. We demonstrate our approach using a car's lidar data with real-time occlusions, and show that by inpainting occluded areas, we can plan longer paths, with more turn options compared to without inpainting; in addition, our approach more closely follows paths derived from a planner with no occlusions (called the ground truth) compared to other state of the art approaches.
translated by 谷歌翻译
Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta-optimizers and understands ICL as a kind of implicit finetuning. Theoretically, we figure out that the Transformer attention has a dual form of gradient descent based optimization. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. Experimentally, we comprehensively compare the behavior of ICL and explicit finetuning based on real tasks to provide empirical evidence that supports our understanding. The results prove that ICL behaves similarly to explicit finetuning at the prediction level, the representation level, and the attention behavior level. Further, inspired by our understanding of meta-optimization, we design a momentum-based attention by analogy with the momentum-based gradient descent algorithm. Its consistently better performance over vanilla attention supports our understanding again from another aspect, and more importantly, it shows the potential to utilize our understanding for future model designing.
translated by 谷歌翻译
Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer.
translated by 谷歌翻译
Large language models have exhibited intriguing in-context learning capability, achieving promising zero- and few-shot performance without updating the parameters. However, conventional in-context learning is usually restricted by length constraints, rendering it ineffective to absorb supervision from a large number of examples. In order to go beyond few shots, we introduce structured prompting that breaks the length limit and scales in-context learning to thousands of examples. Specifically, demonstration examples are separately encoded with well-designed position embeddings, and then they are jointly attended by the test example using a rescaled attention mechanism. So we can scale the number of exemplars with linear complexity instead of quadratic complexity with respect to length. Experimental results on a diverse set of tasks show that our approach improves end-task performance and reduces evaluation variance over conventional in-context learning as the number of demonstration examples increases. Code has been released at https://aka.ms/structured-prompting.
translated by 谷歌翻译
It is crucial to choose the appropriate scale in order to build an effective and informational representation of a complex system. Scientists carefully choose the scales for their experiments to extract the variables that describe the causalities in the system. They found that the coarse scale(macro) is sometimes more causal and informative than the numerous-parameter observations(micro). The phenomenon that the causality emerges by coarse-graining is called Causal Emergence(CE). Based on information theory, a number of recent works quantitatively showed that CE indeed happens while coarse-graining a micro model to the macro. However, the existing works have not discussed the question of why and when the CE happens. We quantitatively analyze the redistribution of uncertainties for coarse-graining and suggest that the redistribution of uncertainties is the cause of causal emergence. We further analyze the thresholds that determine if CE happens or not. From the regularity of the transition probability matrix(TPM) of discrete systems, the mathematical expressions of the model properties are derived. The values of thresholds for different operations are computed. The results provide the critical and specific conditions of CE as helpful suggestions for choosing the proper coarse-graining operation. The results also provided a new way to better understand the nature of causality and causal emergence.
translated by 谷歌翻译